Embeddings and text representations

Embeddings are dense, low-dimensional vector representations of data, such as words, images, or entities, that capture their semantic or contextual meaning in a continuous space. They are used to convert high-dimensional or categorical data into a format suitable for computational models, enabling efficient similarity comparisons and feature extraction.

Resources

Bag of words

A commonly used model in methods of Text Classification. As part of the BOW model, a piece of text (sentence or a document) is represented as a bag or multiset of words, disregarding grammar and even word order and the frequency or occurrence of each word is used as a feature for training a classifier.
BoW is different from Word2vec, which we’ll cover next. The main difference is that Word2vec produces one vector per word, whereas BoW produces one number (a wordcount). Word2vec is great for digging into documents and identifying content and subsets of content. Its vectors represent each word’s context, the ngrams of which it is a part. BoW is good for classifying documents as a whole.

TF–IDF

https://en.wikipedia.org/wiki/Tf–idf - Term Frequency-Inverse Document Frequency
Numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.